51 research outputs found
Local SGD Converges Fast and Communicates Little
Mini-batch stochastic gradient descent (SGD) is state of the art in large
scale distributed training. The scheme can reach a linear speedup with respect
to the number of workers, but this is rarely seen in practice as the scheme
often suffers from large network delays and bandwidth limits. To overcome this
communication bottleneck recent works propose to reduce the communication
frequency. An algorithm of this type is local SGD that runs SGD independently
in parallel on different workers and averages the sequences only once in a
while.
This scheme shows promising results in practice, but eluded thorough
theoretical analysis. We prove concise convergence rates for local SGD on
convex problems and show that it converges at the same rate as mini-batch SGD
in terms of number of evaluated gradients, that is, the scheme achieves linear
speedup in the number of workers and mini-batch size. The number of
communication rounds can be reduced up to a factor of T^{1/2}---where T denotes
the number of total steps---compared to mini-batch SGD. This also holds for
asynchronous implementations. Local SGD can also be used for large scale
training of deep learning models.
The results shown here aim serving as a guideline to further explore the
theoretical and practical aspects of local SGD in these applications.Comment: to appear at ICLR 2019, 19 page
Variable Metric Random Pursuit
We consider unconstrained randomized optimization of smooth convex objective
functions in the gradient-free setting. We analyze Random Pursuit (RP)
algorithms with fixed (F-RP) and variable metric (V-RP). The algorithms only
use zeroth-order information about the objective function and compute an
approximate solution by repeated optimization over randomly chosen
one-dimensional subspaces. The distribution of search directions is dictated by
the chosen metric.
Variable Metric RP uses novel variants of a randomized zeroth-order Hessian
approximation scheme recently introduced by Leventhal and Lewis (D. Leventhal
and A. S. Lewis., Optimization 60(3), 329--245, 2011). We here present (i) a
refined analysis of the expected single step progress of RP algorithms and
their global convergence on (strictly) convex functions and (ii) novel
convergence bounds for V-RP on strongly convex functions. We also quantify how
well the employed metric needs to match the local geometry of the function in
order for the RP algorithms to converge with the best possible rate.
Our theoretical results are accompanied by numerical experiments, comparing
V-RP with the derivative-free schemes CMA-ES, Implicit Filtering, Nelder-Mead,
NEWUOA, Pattern-Search and Nesterov's gradient-free algorithms.Comment: 42 pages, 6 figures, 15 tables, submitted to journal, Version 3:
majorly revised second part, i.e. Section 5 and Appendi
Adaptive SGD with Polyak stepsize and Line-search: Robust Convergence and Variance Reduction
The recently proposed stochastic Polyak stepsize (SPS) and stochastic
line-search (SLS) for SGD have shown remarkable effectiveness when training
over-parameterized models. However, in non-interpolation settings, both
algorithms only guarantee convergence to a neighborhood of a solution which may
result in a worse output than the initial guess. While artificially decreasing
the adaptive stepsize has been proposed to address this issue (Orvieto et al.
[2022]), this approach results in slower convergence rates for convex and
over-parameterized models. In this work, we make two contributions: Firstly, we
propose two new variants of SPS and SLS, called AdaSPS and AdaSLS, which
guarantee convergence in non-interpolation settings and maintain sub-linear and
linear convergence rates for convex and strongly convex functions when training
over-parameterized models. AdaSLS requires no knowledge of problem-dependent
parameters, and AdaSPS requires only a lower bound of the optimal function
value as input. Secondly, we equip AdaSPS and AdaSLS with a novel variance
reduction technique and obtain algorithms that require
gradient evaluations to achieve
an -suboptimality for convex functions, which improves
upon the slower rates of AdaSPS and AdaSLS without
variance reduction in the non-interpolation regimes. Moreover, our result
matches the fast rates of AdaSVRG but removes the inner-outer-loop structure,
which is easier to implement and analyze. Finally, numerical experiments on
synthetic and real datasets validate our theory and demonstrate the
effectiveness and robustness of our algorithms
Spectral Preconditioning for Gradient Methods on Graded Non-convex Functions
The performance of optimization methods is often tied to the spectrum of the
objective Hessian. Yet, conventional assumptions, such as smoothness, do often
not enable us to make finely-grained convergence statements -- particularly not
for non-convex problems. Striving for a more intricate characterization of
complexity, we introduce a unique concept termed graded non-convexity. This
allows to partition the class of non-convex problems into a nested chain of
subclasses. Interestingly, many traditional non-convex objectives, including
partially convex problems, matrix factorizations, and neural networks, fall
within these subclasses. As a second contribution, we propose gradient methods
with spectral preconditioning, which employ inexact top eigenvectors of the
Hessian to address the ill-conditioning of the problem, contingent on the
grade. Our analysis reveals that these new methods provide provably superior
convergence rates compared to basic gradient descent on applicable problem
classes, particularly when large gaps exist between the top eigenvalues of the
Hessian. Our theory is validated by numerical experiments executed on multiple
practical machine learning problems
- …